Closer look on Titles of Popular Videos¶

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

"Tokens" are usually individual words (at least in languages like English) and "tokenization" is taking a text or set of text and breaking it up into individual its words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words). We need to tokenize word so that we can use it for our title generating model/other cool analysis (source)

InĀ [23]:
#simple example 

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"An apple is not a banana")
for token in doc:
    print(token.text)
An
apple
is
not
a
banana
0 1 2 3 4 5
An apple isnot a banana
InĀ [24]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
parser = English()

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens


tqdm.pandas()

normal = three_countries["title"][three_countries["popular"] == 0].progress_apply(spacy_tokenizer)
popular = three_countries["title"][three_countries["popular"] == 1].progress_apply(spacy_tokenizer)
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 27452/27452 [00:28<00:00, 963.18it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2866/2866 [00:02<00:00, 973.33it/s] 
InĀ [25]:
#tokenize words by popularity 

def word_generator(text):
    word = list(text.split())
    return word
def bigram_generator(text):
    bgram = list(nltk.bigrams(text.split()))
    bgram = [' '.join((a, b)) for (a, b) in bgram]
    return bgram
def trigram_generator(text):
    tgram = list(nltk.trigrams(text.split()))
    tgram = [' '.join((a, b, c)) for (a, b, c) in tgram]
    return tgram


normal_words = normal.progress_apply(word_generator)
popular_words = popular.progress_apply(word_generator)
normal_bigrams = normal.progress_apply(bigram_generator)
popular_bigrams = popular.progress_apply(bigram_generator)
normal_trigrams = normal.progress_apply(trigram_generator)
popular_trigrams = popular.progress_apply(trigram_generator)
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 27452/27452 [00:00<00:00, 297169.87it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2866/2866 [00:00<00:00, 283571.40it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 27452/27452 [00:00<00:00, 52267.04it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2866/2866 [00:00<00:00, 73730.06it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 27452/27452 [00:00<00:00, 96072.19it/s]
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2866/2866 [00:00<00:00, 85867.69it/s]
InĀ [26]:
#function that makes a pretty word frequency plot

def word_plot(words,my_color):
    slist =[]
    for x in words:
        slist.extend(x)
    fig = plt.figure(figsize=(15, 10))
    pd.Series(slist).value_counts()[:20].sort_values(ascending=True).plot(kind='barh',fontsize=20, color=my_color)
    plt.show()
InĀ [27]:
word_plot(popular_words,'blue')
InĀ [28]:
word_plot(popular_bigrams,'orange')
InĀ [29]:
word_plot(popular_trigrams,'red')

TfidfVectorizer¶

Tf-idf analyzes the impact of tokens (words) throughout the whole documents. For example, the more times a word appears in a document (each title), the more weight it will have. However, the more documents (titles) the word appears in, it is 'penalized' and the weight is diminished because it is empirically less informative than features that occur in a small fraction of the training corpus (source)

  • tf(t)= the term frequency is the number of times the term appears in the document
  • idf(d, t) = the document frequency is the number of documents 'd' that contain term 't'

For example, the word "banana" appears all documents , so its idf is the lowest¶

InĀ [30]:
txt1 = ['I like banana', 'An apple is not a banana', 'banana banana oh banana']
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted = tf.fit(txt1)
txt_transformed = txt_fitted.transform(txt1)
print ("The text: ", txt1)
The text:  ['I like banana', 'An apple is not a banana', 'banana banana oh banana']
InĀ [31]:
tf.vocabulary_
Out[31]:
{'like': 4, 'banana': 2, 'an': 0, 'apple': 1, 'is': 3, 'not': 5, 'oh': 6}
InĀ [32]:
idf = tf.idf_
print(dict(zip(txt_fitted.get_feature_names(), idf)))
print("\nThe token 'banana' appears 5 times but it is also in all documents, so its idf is the lowest")
{'an': 2.09861228866811, 'apple': 2.09861228866811, 'banana': 1.0, 'is': 2.09861228866811, 'like': 2.09861228866811, 'not': 2.09861228866811, 'oh': 2.09861228866811}

The token 'banana' appears 5 times but it is also in all documents, so its idf is the lowest
InĀ [33]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(three_countries.title)
word_features = word_vectorizer.transform(three_countries.title)

classifier_popular = LogisticRegression(C=0.1, solver='sag')
classifier_popular.fit(word_features ,three_countries.popular)
Out[33]:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='sag',
          tol=0.0001, verbose=0, warm_start=False)
InĀ [34]:
names=['normal','popular']
InĀ [35]:
c_tf = make_pipeline( word_vectorizer,classifier_popular)
explainer_tf = LimeTextExplainer(class_names=names)

exp = explainer_tf.explain_instance(three_countries.title.iloc[10], c_tf.predict_proba, num_features=4, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[10])
InĀ [36]:
exp = explainer_tf.explain_instance(three_countries.title.iloc[4], c_tf.predict_proba, num_features=5, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[4])
InĀ [37]:
exp = explainer_tf.explain_instance(three_countries.title.iloc[10035], c_tf.predict_proba, num_features=5, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[10035])
InĀ [38]:
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.tools as tls


labels = list(three_countries.category_names.value_counts().index.values)
values = list(three_countries.category_names.value_counts().values)

trace = go.Pie(labels=labels, values=values)

iplot([trace], filename='basic_pie_chart')
InĀ [39]:
three_countries.groupby('category_names')['views'].describe()
Out[39]:
count mean std min 25% 50% 75% max
category_names
Autos & Vehicles 310.0 3.188070e+05 3.020746e+05 1858.0 84825.50 272747.0 480935.00 1992282.0
Comedy 2305.0 5.366228e+05 7.103701e+05 1295.0 146885.00 335219.0 648648.00 12924004.0
Education 763.0 2.852218e+05 5.537642e+05 773.0 92603.00 179824.0 326820.50 12100921.0
Entertainment 9730.0 4.598304e+05 1.043833e+06 733.0 95177.00 211936.5 451687.25 37736281.0
Film & Animation 1431.0 4.938655e+05 1.021451e+06 943.0 74924.50 168132.0 473297.50 15969920.0
Gaming 966.0 4.992588e+05 9.087265e+05 1237.0 128074.00 233124.0 531995.25 15919643.0
Howto & Style 1780.0 3.806085e+05 6.467288e+05 1107.0 65779.00 158959.0 407440.50 8307705.0
Movies 1.0 2.255280e+05 NaN 225528.0 225528.00 225528.0 225528.00 225528.0
Music 2479.0 1.230716e+06 3.113640e+06 1591.0 137488.50 364212.0 976700.00 47669287.0
News & Politics 3414.0 2.095382e+05 3.732844e+05 549.0 47722.25 109291.0 213227.75 5943011.0
People & Blogs 3072.0 3.547916e+05 7.971334e+05 884.0 61473.25 147908.0 363105.25 20921796.0
Pets & Animals 325.0 3.150967e+05 5.431828e+05 3393.0 89108.00 172566.0 360901.00 5954189.0
Science & Technology 900.0 4.894011e+05 8.736252e+05 983.0 145697.75 320030.0 599272.00 15316016.0
Shows 106.0 5.252397e+05 2.015272e+05 36609.0 452812.25 518367.5 627156.25 997580.0
Sports 2422.0 5.219269e+05 8.680839e+05 658.0 69955.25 259285.0 605685.25 15860214.0
Travel & Events 247.0 2.338437e+05 3.018434e+05 789.0 104275.50 182617.0 252116.00 3209475.0

Latent Dirichlet Allocation (LDA) by Category¶

InĀ [40]:
entertainment_title= three_countries["title"][(three_countries['category_names'] == 'Entertainment')] 
news_politics_title= three_countries["title"][(three_countries['category_names'] == 'News & Politics')] 
people_title= three_countries["title"][(three_countries['category_names'] == 'People & Blogs')] 
music_title= three_countries["title"][(three_countries['category_names'] == 'Music')] 
sports_title= three_countries["title"][(three_countries['category_names'] == 'Sports')] 
comedy_title= three_countries["title"][(three_countries['category_names'] == 'Comedy')] 
InĀ [41]:
vectorizer_entertainment_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
entertainment_title_vectorized = vectorizer_entertainment_title.fit_transform(entertainment_title)
lda_popular_entertainment_title_vectorized = LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
entertainment_title_vectorized_lda = lda_popular_entertainment_title_vectorized.fit_transform(entertainment_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_popular_entertainment_title_vectorized,entertainment_title_vectorized, vectorizer_entertainment_title, mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[41]:
InĀ [42]:
vectorizer_news_politics_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
news_politics_title_vectorized = vectorizer_news_politics_title.fit_transform(news_politics_title)
lda_news_politics_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
news_politics_title_vectorized_lda = lda_news_politics_title_vectorized.fit_transform(news_politics_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_news_politics_title_vectorized,news_politics_title_vectorized, vectorizer_news_politics_title , mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[42]:
InĀ [43]:
vectorizer_people_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
people_title_vectorized = vectorizer_people_title.fit_transform(people_title)
lda_people_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
people_title_vectorized_lda = lda_people_title_vectorized.fit_transform(people_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_people_title_vectorized,people_title_vectorized, vectorizer_people_title , mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[43]:
InĀ [44]:
vectorizer_music_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
music_title_vectorized = vectorizer_music_title.fit_transform(music_title)
lda_music_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
music_title_vectorized_lda = lda_music_title_vectorized.fit_transform(music_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_music_title_vectorized,music_title_vectorized, vectorizer_music_title , mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[44]:
InĀ [45]:
vectorizer_sports_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
sports_title_vectorized = vectorizer_sports_title.fit_transform(sports_title)
lda_sports_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
sports_title_vectorized_lda = lda_sports_title_vectorized.fit_transform(sports_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_sports_title_vectorized,sports_title_vectorized, vectorizer_sports_title , mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[45]:
InĀ [46]:
vectorizer_comedy_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
comedy_title_vectorized = vectorizer_comedy_title.fit_transform(comedy_title)
lda_comedy_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
comedy_title_vectorized_lda = lda_comedy_title_vectorized.fit_transform(comedy_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_comedy_title_vectorized,comedy_title_vectorized, vectorizer_comedy_title , mds='tsne')
dash
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
Out[46]: